{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "copyright-notice"
      },
      "source": [
        "#### Copyright 2017 Google LLC."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "copyright-notice2"
      },
      "outputs": [],
      "source": [
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "g4T-_IsVbweU"
      },
      "source": [
        " # Dispersión y regularización L1"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "g8ue2FyFIjnQ"
      },
      "source": [
        " **Objetivos de aprendizaje:**\n",
        "  * calcular el tamaño de un modelo\n",
        "  * aplicar regularización L1 para reducir el tamaño de un modelo al aumentar la dispersión"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ME_WXE7cIjnS"
      },
      "source": [
        " Una forma de reducir la complejidad es usar una función de regularización que incentive que las ponderaciones sean de exactamente cero. Para los modelos lineales, como los de regresión, una ponderación de cero es equivalente a no usar el atributo correspondiente en absoluto. Además de evitar el sobreajuste, el modelo resultante será más eficaz.\n",
        "La regularización L1 es una buena forma de aumentar la dispersión.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "fHRzeWkRLrHF"
      },
      "source": [
        " ## Preparación\n",
        "\n",
        "Ejecuta las celdas a continuación para cargar los datos y crear definiciones de atributos."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "pb7rSrLKIjnS"
      },
      "outputs": [],
      "source": [
        "from __future__ import print_function\n",
        "\n",
        "import math\n",
        "\n",
        "from IPython import display\n",
        "from matplotlib import cm\n",
        "from matplotlib import gridspec\n",
        "from matplotlib import pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn import metrics\n",
        "import tensorflow as tf\n",
        "from tensorflow.python.data import Dataset\n",
        "\n",
        "tf.logging.set_verbosity(tf.logging.ERROR)\n",
        "pd.options.display.max_rows = 10\n",
        "pd.options.display.float_format = '{:.1f}'.format\n",
        "\n",
        "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
        "\n",
        "california_housing_dataframe = california_housing_dataframe.reindex(\n",
        "    np.random.permutation(california_housing_dataframe.index))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "3V7q8jk0IjnW"
      },
      "outputs": [],
      "source": [
        "def preprocess_features(california_housing_dataframe):\n",
        "  \"\"\"Prepares input features from California housing data set.\n",
        "\n",
        "  Args:\n",
        "    california_housing_dataframe: A Pandas DataFrame expected to contain data\n",
        "      from the California housing data set.\n",
        "  Returns:\n",
        "    A DataFrame that contains the features to be used for the model, including\n",
        "    synthetic features.\n",
        "  \"\"\"\n",
        "  selected_features = california_housing_dataframe[\n",
        "    [\"latitude\",\n",
        "     \"longitude\",\n",
        "     \"housing_median_age\",\n",
        "     \"total_rooms\",\n",
        "     \"total_bedrooms\",\n",
        "     \"population\",\n",
        "     \"households\",\n",
        "     \"median_income\"]]\n",
        "  processed_features = selected_features.copy()\n",
        "  # Create a synthetic feature.\n",
        "  processed_features[\"rooms_per_person\"] = (\n",
        "    california_housing_dataframe[\"total_rooms\"] /\n",
        "    california_housing_dataframe[\"population\"])\n",
        "  return processed_features\n",
        "\n",
        "def preprocess_targets(california_housing_dataframe):\n",
        "  \"\"\"Prepares target features (i.e., labels) from California housing data set.\n",
        "\n",
        "  Args:\n",
        "    california_housing_dataframe: A Pandas DataFrame expected to contain data\n",
        "      from the California housing data set.\n",
        "  Returns:\n",
        "    A DataFrame that contains the target feature.\n",
        "  \"\"\"\n",
        "  output_targets = pd.DataFrame()\n",
        "  # Create a boolean categorical feature representing whether the\n",
        "  # median_house_value is above a set threshold.\n",
        "  output_targets[\"median_house_value_is_high\"] = (\n",
        "    california_housing_dataframe[\"median_house_value\"] \u003e 265000).astype(float)\n",
        "  return output_targets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "pAG3tmgwIjnY"
      },
      "outputs": [],
      "source": [
        "# Choose the first 12000 (out of 17000) examples for training.\n",
        "training_examples = preprocess_features(california_housing_dataframe.head(12000))\n",
        "training_targets = preprocess_targets(california_housing_dataframe.head(12000))\n",
        "\n",
        "# Choose the last 5000 (out of 17000) examples for validation.\n",
        "validation_examples = preprocess_features(california_housing_dataframe.tail(5000))\n",
        "validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))\n",
        "\n",
        "# Double-check that we've done the right thing.\n",
        "print(\"Training examples summary:\")\n",
        "display.display(training_examples.describe())\n",
        "print(\"Validation examples summary:\")\n",
        "display.display(validation_examples.describe())\n",
        "\n",
        "print(\"Training targets summary:\")\n",
        "display.display(training_targets.describe())\n",
        "print(\"Validation targets summary:\")\n",
        "display.display(validation_targets.describe())"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "gHkniRI1Ijna"
      },
      "outputs": [],
      "source": [
        "def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):\n",
        "    \"\"\"Trains a linear regression model.\n",
        "  \n",
        "    Args:\n",
        "      features: pandas DataFrame of features\n",
        "      targets: pandas DataFrame of targets\n",
        "      batch_size: Size of batches to be passed to the model\n",
        "      shuffle: True or False. Whether to shuffle the data.\n",
        "      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely\n",
        "    Returns:\n",
        "      Tuple of (features, labels) for next data batch\n",
        "    \"\"\"\n",
        "  \n",
        "    # Convert pandas data into a dict of np arrays.\n",
        "    features = {key:np.array(value) for key,value in dict(features).items()}                                            \n",
        " \n",
        "    # Construct a dataset, and configure batching/repeating.\n",
        "    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit\n",
        "    ds = ds.batch(batch_size).repeat(num_epochs)\n",
        "    \n",
        "    # Shuffle the data, if specified.\n",
        "    if shuffle:\n",
        "      ds = ds.shuffle(10000)\n",
        "    \n",
        "    # Return the next batch of data.\n",
        "    features, labels = ds.make_one_shot_iterator().get_next()\n",
        "    return features, labels"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "bLzK72jkNJPf"
      },
      "outputs": [],
      "source": [
        "def get_quantile_based_buckets(feature_values, num_buckets):\n",
        "  quantiles = feature_values.quantile(\n",
        "    [(i+1.)/(num_buckets + 1.) for i in range(num_buckets)])\n",
        "  return [quantiles[q] for q in quantiles.keys()]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "al2YQpKyIjnd"
      },
      "outputs": [],
      "source": [
        "def construct_feature_columns():\n",
        "  \"\"\"Construct the TensorFlow Feature Columns.\n",
        "\n",
        "  Returns:\n",
        "    A set of feature columns\n",
        "  \"\"\"\n",
        "\n",
        "  bucketized_households = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"households\"),\n",
        "    boundaries=get_quantile_based_buckets(training_examples[\"households\"], 10))\n",
        "  bucketized_longitude = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"longitude\"),\n",
        "    boundaries=get_quantile_based_buckets(training_examples[\"longitude\"], 50))\n",
        "  bucketized_latitude = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"latitude\"),\n",
        "    boundaries=get_quantile_based_buckets(training_examples[\"latitude\"], 50))\n",
        "  bucketized_housing_median_age = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"housing_median_age\"),\n",
        "    boundaries=get_quantile_based_buckets(\n",
        "      training_examples[\"housing_median_age\"], 10))\n",
        "  bucketized_total_rooms = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"total_rooms\"),\n",
        "    boundaries=get_quantile_based_buckets(training_examples[\"total_rooms\"], 10))\n",
        "  bucketized_total_bedrooms = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"total_bedrooms\"),\n",
        "    boundaries=get_quantile_based_buckets(training_examples[\"total_bedrooms\"], 10))\n",
        "  bucketized_population = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"population\"),\n",
        "    boundaries=get_quantile_based_buckets(training_examples[\"population\"], 10))\n",
        "  bucketized_median_income = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"median_income\"),\n",
        "    boundaries=get_quantile_based_buckets(training_examples[\"median_income\"], 10))\n",
        "  bucketized_rooms_per_person = tf.feature_column.bucketized_column(\n",
        "    tf.feature_column.numeric_column(\"rooms_per_person\"),\n",
        "    boundaries=get_quantile_based_buckets(\n",
        "      training_examples[\"rooms_per_person\"], 10))\n",
        "\n",
        "  long_x_lat = tf.feature_column.crossed_column(\n",
        "    set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000)\n",
        "\n",
        "  feature_columns = set([\n",
        "    long_x_lat,\n",
        "    bucketized_longitude,\n",
        "    bucketized_latitude,\n",
        "    bucketized_housing_median_age,\n",
        "    bucketized_total_rooms,\n",
        "    bucketized_total_bedrooms,\n",
        "    bucketized_population,\n",
        "    bucketized_households,\n",
        "    bucketized_median_income,\n",
        "    bucketized_rooms_per_person])\n",
        "  \n",
        "  return feature_columns"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hSBwMrsrE21n"
      },
      "source": [
        " ## Cálculo del tamaño del modelo\n",
        "\n",
        "Para calcular el tamaño del modelo, simplemente contamos la cantidad de parámetros que no son cero. Más abajo se incluye una función de ayuda para hacer esto. La función usa conocimientos profundos de la API de Estimators; no te preocupes por comprender cómo funciona."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "e6GfTI0CFhB8"
      },
      "outputs": [],
      "source": [
        "def model_size(estimator):\n",
        "  variables = estimator.get_variable_names()\n",
        "  size = 0\n",
        "  for variable in variables:\n",
        "    if not any(x in variable \n",
        "               for x in ['global_step',\n",
        "                         'centered_bias_weight',\n",
        "                         'bias_weight',\n",
        "                         'Ftrl']\n",
        "              ):\n",
        "      size += np.count_nonzero(estimator.get_variable_value(variable))\n",
        "  return size"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "XabdAaj67GfF"
      },
      "source": [
        " ## Reducción del tamaño del modelo\n",
        "\n",
        "Tu equipo necesita desarrollar un modelo de regresión logística muy preciso sobre el *SmartRing*, un anillo que es tan inteligente que puede detectar la demografía de una manzana (`median_income`, `avg_rooms`, `households`, etc.) e indicarte si la manzana específica es de costo elevado o no.\n",
        "\n",
        "Dado que el SmartRing es pequeño, el equipo de ingeniería determinó que solo puede abordar un modelo que tenga **no más de 600 parámetros**. Por otro lado, el equipo de administración de productos determinó que el modelo no se puede lanzar a menos que la **pérdida logística sea menor que 0.35** en el conjunto de prueba de exclusión.\n",
        "\n",
        "¿Puedes usar tu arma secreta, la regularización L1, para ajustar el modelo para satisfacer las restricciones de tamaño y exactitud?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "G79hGRe7qqej"
      },
      "source": [
        " ### Tarea 1: Buscar un buen coeficiente de regularización\n",
        "\n",
        "**Busca un parámetro de potencia de regularización L1 que satisfaga ambas limitaciones, que el tamaño del modelo sea menor que 600 y que la pérdida logística sea menor que 0.35 en el conjunto de validación.**\n",
        "\n",
        "El código que aparece a continuación te ayudará a comenzar. Hay muchas formas de aplicar la regularización al modelo. Aquí, elegimos hacerlo con la función `FtrlOptimizer`, que está diseñada para dar mejores resultados con la regularización L1 que con el descenso de gradientes estándar.\n",
        "\n",
        "Nuevamente, el modelo se entrenará con el conjunto de datos completo, de manera que esperamos que se ejecute más lentamente que lo normal."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "1Fcdm0hpIjnl"
      },
      "outputs": [],
      "source": [
        "def train_linear_classifier_model(\n",
        "    learning_rate,\n",
        "    regularization_strength,\n",
        "    steps,\n",
        "    batch_size,\n",
        "    feature_columns,\n",
        "    training_examples,\n",
        "    training_targets,\n",
        "    validation_examples,\n",
        "    validation_targets):\n",
        "  \"\"\"Trains a linear regression model.\n",
        "  \n",
        "  In addition to training, this function also prints training progress information,\n",
        "  as well as a plot of the training and validation loss over time.\n",
        "  \n",
        "  Args:\n",
        "    learning_rate: A `float`, the learning rate.\n",
        "    regularization_strength: A `float` that indicates the strength of the L1\n",
        "       regularization. A value of `0.0` means no regularization.\n",
        "    steps: A non-zero `int`, the total number of training steps. A training step\n",
        "      consists of a forward and backward pass using a single batch.\n",
        "    feature_columns: A `set` specifying the input feature columns to use.\n",
        "    training_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for training.\n",
        "    training_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for training.\n",
        "    validation_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for validation.\n",
        "    validation_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for validation.\n",
        "      \n",
        "  Returns:\n",
        "    A `LinearClassifier` object trained on the training data.\n",
        "  \"\"\"\n",
        "\n",
        "  periods = 7\n",
        "  steps_per_period = steps / periods\n",
        "\n",
        "  # Create a linear classifier object.\n",
        "  my_optimizer = tf.train.FtrlOptimizer(learning_rate=learning_rate, l1_regularization_strength=regularization_strength)\n",
        "  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)\n",
        "  linear_classifier = tf.estimator.LinearClassifier(\n",
        "      feature_columns=feature_columns,\n",
        "      optimizer=my_optimizer\n",
        "  )\n",
        "  \n",
        "  # Create input functions.\n",
        "  training_input_fn = lambda: my_input_fn(training_examples, \n",
        "                                          training_targets[\"median_house_value_is_high\"], \n",
        "                                          batch_size=batch_size)\n",
        "  predict_training_input_fn = lambda: my_input_fn(training_examples, \n",
        "                                                  training_targets[\"median_house_value_is_high\"], \n",
        "                                                  num_epochs=1, \n",
        "                                                  shuffle=False)\n",
        "  predict_validation_input_fn = lambda: my_input_fn(validation_examples, \n",
        "                                                    validation_targets[\"median_house_value_is_high\"], \n",
        "                                                    num_epochs=1, \n",
        "                                                    shuffle=False)\n",
        "  \n",
        "  # Train the model, but do so inside a loop so that we can periodically assess\n",
        "  # loss metrics.\n",
        "  print(\"Training model...\")\n",
        "  print(\"LogLoss (on validation data):\")\n",
        "  training_log_losses = []\n",
        "  validation_log_losses = []\n",
        "  for period in range (0, periods):\n",
        "    # Train the model, starting from the prior state.\n",
        "    linear_classifier.train(\n",
        "        input_fn=training_input_fn,\n",
        "        steps=steps_per_period\n",
        "    )\n",
        "    # Take a break and compute predictions.\n",
        "    training_probabilities = linear_classifier.predict(input_fn=predict_training_input_fn)\n",
        "    training_probabilities = np.array([item['probabilities'] for item in training_probabilities])\n",
        "    \n",
        "    validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)\n",
        "    validation_probabilities = np.array([item['probabilities'] for item in validation_probabilities])\n",
        "    \n",
        "    # Compute training and validation loss.\n",
        "    training_log_loss = metrics.log_loss(training_targets, training_probabilities)\n",
        "    validation_log_loss = metrics.log_loss(validation_targets, validation_probabilities)\n",
        "    # Occasionally print the current loss.\n",
        "    print(\"  period %02d : %0.2f\" % (period, validation_log_loss))\n",
        "    # Add the loss metrics from this period to our list.\n",
        "    training_log_losses.append(training_log_loss)\n",
        "    validation_log_losses.append(validation_log_loss)\n",
        "  print(\"Model training finished.\")\n",
        "\n",
        "  # Output a graph of loss metrics over periods.\n",
        "  plt.ylabel(\"LogLoss\")\n",
        "  plt.xlabel(\"Periods\")\n",
        "  plt.title(\"LogLoss vs. Periods\")\n",
        "  plt.tight_layout()\n",
        "  plt.plot(training_log_losses, label=\"training\")\n",
        "  plt.plot(validation_log_losses, label=\"validation\")\n",
        "  plt.legend()\n",
        "\n",
        "  return linear_classifier"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "9H1CKHSzIjno"
      },
      "outputs": [],
      "source": [
        "linear_classifier = train_linear_classifier_model(\n",
        "    learning_rate=0.1,\n",
        "    # TWEAK THE REGULARIZATION VALUE BELOW\n",
        "    regularization_strength=0.0,\n",
        "    steps=300,\n",
        "    batch_size=100,\n",
        "    feature_columns=construct_feature_columns(),\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)\n",
        "print(\"Model size:\", model_size(linear_classifier))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "yjUCX5LAkxAX"
      },
      "source": [
        " ### Solución\n",
        "\n",
        "Haz clic más abajo para ver una solución posible."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hgGhy-okmkWL"
      },
      "source": [
        " Una potencia de regularización de 0.1 debería ser suficiente. Ten en cuenta que se debe lograr un equilibrio:\n",
        "la regularización más potente nos da modelos más pequeños, pero puede afectar la pérdida de clasificación."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "colab_type": "code",
        "id": "_rV8YQWZIjns"
      },
      "outputs": [],
      "source": [
        "linear_classifier = train_linear_classifier_model(\n",
        "    learning_rate=0.1,\n",
        "    regularization_strength=0.1,\n",
        "    steps=300,\n",
        "    batch_size=100,\n",
        "    feature_columns=construct_feature_columns(),\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)\n",
        "print(\"Model size:\", model_size(linear_classifier))"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [
        "copyright-notice",
        "yjUCX5LAkxAX"
      ],
      "default_view": {},
      "name": "sparsity_and_l1_regularization.ipynb",
      "provenance": [],
      "version": "0.3.2",
      "views": {}
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
